#memory-management

[ follow ]
#pagedattention

Evaluating vLLM With Basic Sampling | HackerNoon

vLLM outperforms other models in handling higher request rates while maintaining low latencies through efficient memory management.

How vLLM Implements Decoding Algorithms | HackerNoon

vLLM optimizes large language model serving through innovative memory management and GPU techniques.

PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoon

PagedAttention optimizes memory usage in language model serving, significantly improving throughput while minimizing KV cache waste.

How Good Is PagedAttention at Memory Sharing? | HackerNoon

Memory sharing in PagedAttention enhances efficiency in LLMs, significantly reducing memory usage during sampling and decoding processes.

Our Method for Developing PagedAttention | HackerNoon

PagedAttention optimizes memory usage in LLM serving by managing key-value pairs in a non-contiguous manner.

Evaluating vLLM's Design Choices With Ablation Experiments | HackerNoon

PagedAttention significantly enhances vLLM's performance despite adding overhead, illustrating the trade-offs in optimizing GPU operations for large language models.

Evaluating vLLM With Basic Sampling | HackerNoon

vLLM outperforms other models in handling higher request rates while maintaining low latencies through efficient memory management.

How vLLM Implements Decoding Algorithms | HackerNoon

vLLM optimizes large language model serving through innovative memory management and GPU techniques.

PagedAttention: An Attention Algorithm Inspired By the Classical Virtual Memory in Operating Systems | HackerNoon

PagedAttention optimizes memory usage in language model serving, significantly improving throughput while minimizing KV cache waste.

How Good Is PagedAttention at Memory Sharing? | HackerNoon

Memory sharing in PagedAttention enhances efficiency in LLMs, significantly reducing memory usage during sampling and decoding processes.

Our Method for Developing PagedAttention | HackerNoon

PagedAttention optimizes memory usage in LLM serving by managing key-value pairs in a non-contiguous manner.

Evaluating vLLM's Design Choices With Ablation Experiments | HackerNoon

PagedAttention significantly enhances vLLM's performance despite adding overhead, illustrating the trade-offs in optimizing GPU operations for large language models.
morepagedattention
#large-language-models

Practical LLMs for Real-World Applications | HackerNoon

Anchor-based LLMs reduce memory use by 99% while improving inference speed by up to 3.5 times, enabling practical use on resource-constrained devices.

How Effective is vLLM When a Prefix Is Thrown Into the Mix? | HackerNoon

vLLM significantly improves throughput in LLM tasks by utilizing shared prefixes among different input prompts.

Practical LLMs for Real-World Applications | HackerNoon

Anchor-based LLMs reduce memory use by 99% while improving inference speed by up to 3.5 times, enabling practical use on resource-constrained devices.

How Effective is vLLM When a Prefix Is Thrown Into the Mix? | HackerNoon

vLLM significantly improves throughput in LLM tasks by utilizing shared prefixes among different input prompts.
morelarge-language-models
#transformer-models

Evaluating the Performance of vLLM: How Did It Do? | HackerNoon

vLLM was tested using various Transformer-based large language models to evaluate its performance under load.

The Generation and Serving Procedures of Typical LLMs: A Quick Explanation | HackerNoon

Transformer-based language models use autoregressive approaches for token sequence probability modeling.

Batching Techniques for LLMs | HackerNoon

Batching improves compute utilization for LLMs, but naive strategies can cause delays and waste resources. Fine-grained batching techniques offer a solution.

Evaluating the Performance of vLLM: How Did It Do? | HackerNoon

vLLM was tested using various Transformer-based large language models to evaluate its performance under load.

The Generation and Serving Procedures of Typical LLMs: A Quick Explanation | HackerNoon

Transformer-based language models use autoregressive approaches for token sequence probability modeling.

Batching Techniques for LLMs | HackerNoon

Batching improves compute utilization for LLMs, but naive strategies can cause delays and waste resources. Fine-grained batching techniques offer a solution.
moretransformer-models
#vllm

The Distributed Execution of vLLM | HackerNoon

Large Language Models often exceed single GPU limits, requiring advanced distributed execution techniques for memory management.

How vLLM Prioritizes a Subset of Requests | HackerNoon

vLLM utilizes FCFS scheduling and an all-or-nothing eviction policy to effectively manage resources and prioritize fairness in request handling.

The Distributed Execution of vLLM | HackerNoon

Large Language Models often exceed single GPU limits, requiring advanced distributed execution techniques for memory management.

How vLLM Prioritizes a Subset of Requests | HackerNoon

vLLM utilizes FCFS scheduling and an all-or-nothing eviction policy to effectively manage resources and prioritize fairness in request handling.
morevllm
#deep-learning

Pytorch Contiguous Tensor Optimization | HackerNoon

Efficient memory management and tensor contiguity are essential for optimizing performance in PyTorch, especially when handling large-scale datasets.

PagedAttention: Memory Management in Existing Systems | HackerNoon

Current LLM serving systems inefficiently manage memory, resulting in significant waste due to fixed size allocations based on potential maximum sequence lengths.

Pytorch Contiguous Tensor Optimization | HackerNoon

Efficient memory management and tensor contiguity are essential for optimizing performance in PyTorch, especially when handling large-scale datasets.

PagedAttention: Memory Management in Existing Systems | HackerNoon

Current LLM serving systems inefficiently manage memory, resulting in significant waste due to fixed size allocations based on potential maximum sequence lengths.
moredeep-learning
from InfoQ
3 weeks ago

Challenges of Creating iOS App Extensions at Lyft

Lyft engineers efficiently manage iOS app extension development by optimizing dependencies, binary size, and memory usage while adhering to Apple's constraints.
#app-compatibility

Android 15 Beta 4 Now Available for Developers to Bring their Apps Up to Date

The final Android 15 beta focuses on stable APIs for developer use and introduces significant behavior changes and new privacy features.

Google urges Android developers to prep for 16 KB memory page

Android developers must prepare for a 16 KB memory page size upgrade to gain performance benefits of 5-10% in apps and games.

Android 15 Beta 4 Now Available for Developers to Bring their Apps Up to Date

The final Android 15 beta focuses on stable APIs for developer use and introduces significant behavior changes and new privacy features.

Google urges Android developers to prep for 16 KB memory page

Android developers must prepare for a 16 KB memory page size upgrade to gain performance benefits of 5-10% in apps and games.
moreapp-compatibility
#scala

How We Saved 12% in Resources with Smarter Heap Management

Optimizing buffer sizes and investigating memory usage is crucial for performance in Scala services to prevent Out of Memory errors.

How We Saved 12% in Resources with Smarter Heap Management

Memory issues in a Scala service stemmed from underutilized buffers in Netty and inefficient memory management by the JSON library Jsoniter.

How We Saved 12% in Resources with Smarter Heap Management

Memory issues in Scala service were traced to buffer mismanagement, leading to high memory usage and frequent GC runs.
Optimized buffer sizes and investigated long-lived object handling to address memory challenges.

How We Saved 12% in Resources with Smarter Heap Management

Optimizing buffer sizes and investigating memory usage is crucial for performance in Scala services to prevent Out of Memory errors.

How We Saved 12% in Resources with Smarter Heap Management

Memory issues in a Scala service stemmed from underutilized buffers in Netty and inefficient memory management by the JSON library Jsoniter.

How We Saved 12% in Resources with Smarter Heap Management

Memory issues in Scala service were traced to buffer mismanagement, leading to high memory usage and frequent GC runs.
Optimized buffer sizes and investigated long-lived object handling to address memory challenges.
morescala

LLM Service & Autoregressive Generation: What This Means | HackerNoon

LLMs generate tokens sequentially, relying on cached key and value vectors from prior tokens for efficient autoregressive generation.
#rust

Handling memory leaks in Rust - LogRocket Blog

Rust's ownership and borrowing principles help manage memory but memory leaks can still occur, necessitating careful management by developers.

A Simplified Comparison: Rust and Pointers | HackerNoon

Rust ensures memory safety through its unique ownership and borrowing model, mitigating risks present in traditional languages.

Handling memory leaks in Rust - LogRocket Blog

Rust's ownership and borrowing principles help manage memory but memory leaks can still occur, necessitating careful management by developers.

A Simplified Comparison: Rust and Pointers | HackerNoon

Rust ensures memory safety through its unique ownership and borrowing model, mitigating risks present in traditional languages.
morerust
#java

Java 24 to Reduce Object Header Size and Save Memory

JEP 450 optimizes Java heap management by implementing compact object headers, which reduces header size and improves memory efficiency.

Java stack and heap definitions

Java stack holds local variables and partial results, while Java heap is where memory for class instances and arrays is allocated.

Java 24 to Reduce Object Header Size and Save Memory

JEP 450 optimizes Java heap management by implementing compact object headers, which reduces header size and improves memory efficiency.

Java stack and heap definitions

Java stack holds local variables and partial results, while Java heap is where memory for class instances and arrays is allocated.
morejava

Chatbot Memory: Implement Your Own Algorithm From Scratch | HackerNoon

Implementing effective memory management is crucial for chatbot development, ensuring fluid and coherent interactions during long conversations.
from The Verge
2 months ago

Chrome introduces new 'Performance' tools to wrangle the tabs gobbling up your memory

Google introduces new Chrome features for better tab management through performance alerts and enhanced Memory Saver options.

An Efficient C++ Fixed Block Memory Allocator

Custom fixed block memory allocators improve memory management efficiency and reduce fragmentation issues, enhancing performance in critical and long-running systems.

Plumbing Life's Depths - Interesting Memory Leak with Python 3.12 for PyOpenGL-accelerate

The PyOpenGL 3.12 test suite's memory leak test fails uniquely on the 76th iteration, suggesting changes in Python's memory management.
#performance-optimization

Improve the performance of your Java application by using these optimizations

Optimize string concatenation with StringBuilder or StringBuffer.
Use local variables for frequently accessed data.
Optimize loops by moving invariant calculations outside.
Use switch statements for better performance.

Understanding Spark Re-Partition

Spark's repartition() function is crucial for managing data skewness, optimizing performance, memory utilization, and downstream query efficiency.

Improve the performance of your Java application by using these optimizations

Optimize string concatenation with StringBuilder or StringBuffer.
Use local variables for frequently accessed data.
Optimize loops by moving invariant calculations outside.
Use switch statements for better performance.

Understanding Spark Re-Partition

Spark's repartition() function is crucial for managing data skewness, optimizing performance, memory utilization, and downstream query efficiency.
moreperformance-optimization
#data-structures

Augmented Linked Lists: An Essential Guide | HackerNoon

Linked lists are efficient for fast addition of data without resizing the entire array, suitable for write-only data, and organizing data for sequential reads.

Augmented Tree Data Structures | HackerNoon

Data structures are key to efficient data storage and organization, crucial for memory management and optimizing software performance.

Augmented Linked Lists: An Essential Guide | HackerNoon

Linked lists are efficient for fast addition of data without resizing the entire array, suitable for write-only data, and organizing data for sequential reads.

Augmented Tree Data Structures | HackerNoon

Data structures are key to efficient data storage and organization, crucial for memory management and optimizing software performance.
moredata-structures
#garbage-collection

ECMAScript proposal: Symbols as WeakMap keys

Symbols as WeakMap keys allow non-mutating attachment of data, preventing memory leaks.

The worst developer nightmare[Memory leak]

Memory leaks can occur in various programming languages, from manually managed to automatic memory systems.
Common causes of memory leaks include unintentional object retention, circular references, unclosed resources, event listeners, caching without expiration, and poor memory profiling.

ECMAScript proposal: Symbols as WeakMap keys

Symbols as WeakMap keys allow non-mutating attachment of data, preventing memory leaks.

The worst developer nightmare[Memory leak]

Memory leaks can occur in various programming languages, from manually managed to automatic memory systems.
Common causes of memory leaks include unintentional object retention, circular references, unclosed resources, event listeners, caching without expiration, and poor memory profiling.
moregarbage-collection

Optimizing Resource Allocation and Parallel Processing for 20GB Spark Jobs

Optimizing resource allocation based on data volume and processing speed is crucial for efficient job completion.

htcw_json: A tiny streaming JSON parser

Efficient JSON parsing solution with a 'pull' style parser is introduced
Ability to chunk values longer than buffer size and handle basic data types like integers, real numbers, and booleans

CISA Report Finds Most Open-Source Projects Contain Memory-Unsafe Code

More than half of critical open-source projects contain memory-unsafe code, leading to vulnerabilities like buffer overflows and memory leaks.

Kubernetes 1.30 Released with Contextual Logging, Improved Performance, and Security

Kubernetes 1.30 introduces improvements like memory swap support, sleep action for PreStop hook, and CEL for admission control.
Enhancements include beta support for user namespaces, more secure service account tokens, and Contextual Logging for better troubleshooting.
Scheduling improvements in 1.30 feature MatchLabelKeys for PodAffinity and PodAntiAffinity, enhancing pod placement strategies.

Linux free memory: How to show the free memory on a Linux system

You can show free memory on a Linux system using the free command with different parameters like -m for MB.
The top utility provides a real-time view of memory use by running applications, and the Linux ps command can be sorted by memory use.

How to control Java heap size (memory) allocation (xmx, xms)

Use -Xmx to specify maximum heap size and control RAM use in Java programs.
[ Load more ]