sanjaybhatikar

About sanjaybhatikar

Posts by :

May 31 2025

Light up My Life!

sanjaybhatikar Arduino, Blog Arduino, DIY, Electronics, ESP32, FastLED, Home Automation, Home Decor Hack, IoT, IoT Projects, LED Lighting, Maker Project, Microcontrollers, Smart Lighting, WS2812b

Product Dancing LED Lights

Why not add a splash of color to spaces in the home? I needed to brighten up a dark area under the loft in my study. For many years, I owned a Philips LED spotlight that projected a pool of any color that the user dialed in. That beautiful lamp inspired me to program an LED strip to display many colorful patterns.

I used a WS2812B LED strip. This model has individually addressable LEDs. My strip had 120 LEDs and was 2 meters in length. I paired it with an ESP32 microcontroller that I programmed in C/C++.

Electrical Schematics The Circuit

In microcontroller projects, there is always the question of managing a variety of power requirements. The ESP32 is a 3.3V microcontroller and the LED strip is 5V. There are three primary patterns for powering up the system as follows:

Power the ESP32 from mains via USB port using a standard cell phone wall charger (rated 2A). The LED strip draws power from the ESP32 dev board.
Power the LED strip independently of the ESP32 with a “wall wart”, rated 5V 10A or higher.
Power the LED strip independently of the ESP32 with a bench-top power supply and inject power into the strip at multiple (at least 2) points from different source channels.

These methods are beautifully illustrated in video by YouTuber and maker Chris Maher.

He recommends various options for WS2812B LED strips based on the current draw for the total number of LEDs in the project. For up to 150 LEDs, he recommends drawing power directly from ESP32, which in turn is powered by cell phone wall charger with 2 Amps rating. for up to 150 LEDs. For 150-300 LEDs he recommends the wall wart option. For upwards of 300 LEDs, he recommends power injection.

I had planned to use two strips with 120 LEDs each. I opted for a “wall wart” rated 5V 5A, acknowledging this might be an underpowered solution. Here is the schematic in EasyEDA.

Digital Schematics The Code

Please find the code in FastColors repo on Github. I use PlatformIO in VSCode and programmed the ESP32 board in Arduino framework. Further, I use the FastLED library to control each individually addressable LED.

Set Up:

I have set up the LED strips in mirror mode to display synchronized patterns. Each strip is controlled by a GPIO pin on the ESP32. I have used GPIO#16 and GPIO#17.

Two important concepts to understand the operation are as follows:

Class FastLED: It is the digital twin of a LED strip and associates a data pin on the ESP32 with color information. The data pin is a GPIO pin that is physically wired to the data pin of one LED strip. The color information is an array declared as CRGB leds[NUM_LEDS]that holds the color of each LED in the strip. The array size NUM_LEDS correspond to the number of LEDs addressed via the data pin. When mirroring, the c0lors are synchronized so the same array can be used across multiple strips.
Palette and Blending: In FastLED, a palette is a collection of 16 colors. A color is obtained by indexing into a palette with numbers in range 0-255. The 16 explicit palette entries are spread evenly across the 0-255 range, and the intermediate values are RGB-interpolated between adjacent explicit entries. FastLED provides a selection of palettes as well as the ability to customize palettes. Further, a selection on blending techniques is provided for interpolation.

Our setup code creates digital twin of each LED strip in mirroring mode.

Server:

The server has two methods in a forever-while loop. These are as follows:

CycleThroughColorPalettes(): Rotates through a selection of color palettes on a timer. The current palette is set here and changed at regular intervals in the specified order.
FillLEDsFromPaletteColors(): Updates the array with color information using the current palette and the current startIndex passed via arg. colorIndex. The function will color the LEDs near-instantaneously, indexing into the current palette, starting from the colorIndex passed.

In the forever-while loop, we update the color information at a set frequency determined by UPDATES_PER_SECOND. The startIndex is incremented on each iteration. The net effect is like picking out the last crayon in a box and sticking it back in first place- the colors shift giving the impression of running lights.

Note that the function CycleThroughColorPalettes() counts off the seconds using a secondsMarker . This item is declared as static variable so it retains value between successive function calls. (Neat, huh?!) In this way, we can change the current palette at regular intervals, thus cycling through palettes.

Packaging Putting It Together

In schematics such as circuit diagram, the project looks much simpler than it is. The devil is in details!

ESP32

ESP32 38-Pin Dev Board

WS2812B LED Strip

Individually Addressable LED Lights

Breakout Board

GPIO Expansion Board

Wago Connectors

3 Conductor Compact Connector Block

Power Supply

Rated 5V 5A JST Connector

CCTV Camera Box

All Skill Levels Welcome

Design elements are described below.

Design

WAGO 221 blocks

These provide power distribution points inside the box.

CCTV Camera box

Serves as project box housing ESP32 and electrical junctions.

Screw Terminals

Each of these provides a quick-connect point for an LED strip.

MCU is powered up and wired

The ESP32 dev board is mounted on a breakout board. the 5V and GND pins are used to power up the device and GPIO pins #16 and #17 are used as data pins.

A custom harness is fabricated for ease of connections

The WS2812B LED strip features a JST connector. A custom harness is fabricated to adapt the connector to a screw terminal block. Once the Lights and project box are placed in position, the 3-slotted screw terminal on LED strip and the similar 3-slotted screw terminal on the box are connected with ferrule terminated connecting wires cut to correct length.

Conclusion Key Takeaways

The project sets about design standards for working with individually addressable LED lights. These include design patterns as follows:

Make junctions for power distribution in the project box with WAGO connectors,
Place the ESP on breakout board concealing junctions on the back side,
Use screw terminals cut to the right size (i.e with right no. of slots) to terminate modular components and run ferrule terminated wires cut to size to make connections once modules are placed in position.

This addresses many issues that come up during implementation of a concept in production.

praveen-thirumurugan-KPAQpJYzH0Y-unsplash

May 30 2025

Git Up and Git Together

sanjaybhatikar Blog, Python Code Review, Collaboration, Dev Team, Dev Tips, DevTeam, git, Git Workflow, github, Pull Request, Rebase, version control 0

As the saying goes:

“In case of fire: git commit, git push, and leave the building.”

Funny, but in reality, using Git solo is a very different experience from collaborating on a dev team. Here’s a practical mental model to help you manage version control when working in a shared repository—let’s say one named icecap.

🧭 Version Control Workflow (for feature development):

git checkout main and git pull
→ Get the latest code from the remote main branch into your local main.
git checkout rev00X
→ Switch to your local feature branch (e.g., rev001, rev002, etc.).
git rebase main
→ Update your local feature branch with the latest changes from local main.
Develop → git commit and git push
→ Save progress locally and sync with the remote feature branch.

Repeat steps 1–4 until your feature is ready for review.

Raise a Pull Request
→ Merge your remote feature branch (rev00X) into the remote main.
Once merged, git pull on local main again
→ Bring your local main up to date with the remote.

Conclusion Key Takeaways

As a developer, you work with:

Local branches: your workspace.
Remote branches: shared team state.

While you’re developing your feature, the remote main is evolving due to contributions from others. To avoid merge conflicts and stay in sync:

Regularly rebase your feature on the latest main.
Keep your local and remote branches aligned.

This four-way sync—between local main, local feature, remote main, and remote feature—ensures a smooth, conflict-minimized workflow and clean pull requests.

May 29 2025

How Language Models Use Tools With No Agency

sanjaybhatikar Blog, Generative AI, Python exec, executor, Large Language Model, LLM, LLM Agents, Python exec function, Tool-Using Agents 0

A language model has no agency. The Llama 3.1 docs mention clearly:

Note that the model itself does not execute the code output; you need to use our llama-stack-apps, or other similar framework to leverage code interpreters. When using llama-stack-apps, the results of the code are passed back to the model for further processing.

So how does a language model “use tools”?

It doesn’t! It generates executable Python code in response to user’s query and emits a special token to signal presence of code to the surrounding system. The surrounding system recognizes the signal. It knows to extract the code. It must furnish the executor to execute the code and return the result appropriately to the model.

Llama 3.1 uses special tokens and a special role for tool usage. The docs mention:

The model’s response is wrapped in a <|python_tag|> and terminated with an <|eom_id|> tag.

The <|eom_id|> indicates a continued multi-step reasoning. That is, the model is expecting a continuation message with the output of the tool call.

Sample input:

	
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Environment: ipython<|eot_id|><|start_header_id|>user<|end_header_id|>

Write code to check if number is prime, use that to see if the number 7 is prime<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Response:

	<|python_tag|>def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True

print(is_prime(7))  # Output: True<|eom_id|>

After the executor has run the code, it returns the result using a special role. As per docs:

There are 4 different roles that are supported by Llama text models:

system: Sets the context in which to interact with the AI model. It typically includes rules, guidelines, or necessary information that help the model respond effectively.

user: Represents the human interacting with the model. It includes the inputs, commands, and questions to the model.

ipython: A new role introduced in Llama 3.1. Semantically, this role means “tool”. This role is used to mark messages with the output of a tool call when sent back to the model from the executor.

assistant: Represents the response generated by the AI model based on the context provided in the system, ipython and user prompts.

So what does an executor look like? For Python code, we can construct an executor using Python’s built-in exec() function! This function allows access to the host program’s environment in a manner that can be regulated with args global and local that are dicts passed to control scope. Refer to RealPython’s blogpost on the exec() function.

Executive Decision Python's Built-in Function

How Language Models Use Tools With No Agency

The execute method of class CodeExecutor can be as simple as:

	    def execute(self, code: str) -> dict:
        try:
            exec(code, self._environment)
        except Exception as e:
            raise CodeExecutionError("Code execution failed") from e
        return self._environment

The environment can be initialized to include any libraries that are required to run Python code in the expected context of the app design. The GitHub repo of Pandasai, a framework for tabular data analysis with LLM, illustrates one possible implementation of executor with environment suited to analyzing spreadsheets.

May 29 2025

Thinking Out Loud: Language Models Reasoning Superpowers

sanjaybhatikar Blog, Generative AI Agentic AI, Andrej Karpathy, FEVER dataset, Few-Shot Prompting, Language Models, LLM Agents, Prompt Engineering, REACT, ReAct Pattern, Retrieval-Augmented Generation (RAG), Tool-Using Agents 0

Language models are, at their core, powerful sentence completion engines that, with clever stage management, can give the impression of being helpful assistants. These models have no agency—they cannot initiate actions on their own. But can they be nudged beyond informative conversation toward reasoning and tool use in problem-solving?

Surprisingly, the answer is yes. With the right prompting, a language model can structure its responses in ways that simulate reasoning. In fact, we can tap into the model’s latent “common sense” to extract a coherent reasoning trajectory in response to a problem. By interleaving thought and action, models can tackle tasks that require multi-step reasoning.

This is the essence of the ReAct pattern. In their paper “ReAct: Synergizing Reasoning and Action in Language Models,”Yuan Cao and colleagues show how to unlock this capability.

Helpful Assistant Thoughtful and Handy

Thinking Out Loud: Language Models Reasoning Superpowers

The interaction between humans and language models is typically structured as a “User–Assistant” pattern, which involves three distinct roles:

System: Configures and guides the behavior of the language model.
User: The human who provides input.
Assistant: The model, which responds to the user’s input.

This setup creates a back-and-forth exchange of messages, alternating between the User and Assistant roles. Often, the User’s input is enriched with contextual information retrieved from a vector database. This process, known as Retrieval-Augmented Generation (RAG), supplements the model’s general knowledge with recent or domain-specific data, leading to more accurate and relevant responses.

The ReAct pattern introduces another twist into the plot. Instead of producing a single-shot response, the model alternates between reasoning steps and external actions, forming a loop of Think → Act → Observe:

Think: The model reasons about the current problem and determines what action might be helpful.
Act: It formulates an action (e.g., calling a tool, querying a database) from a predefined set of allowed actions.
Observe: After the action is performed—by a tool or the environment—the model receives the result as input, which it uses to inform the next reasoning step.

During the “Act” phase, the model essentially pauses, awaiting external input. Once the observation is received, it integrates this new information and continues the cycle. This interleaving of thought, action, and observation enables the model to incrementally gather and apply information—ideal for tasks that require multi-step reasoning and tool use.

Roomba Race Reasoning Guides Action

Think about a Roomba. When tasked with cleaning, it proceeds through a series of actions guided by internal reasoning. After each action, it makes an observation—such as detecting an obstacle or identifying a new area—and evaluates the next step accordingly. This continuous sense-think-act loop enables the Roomba to adapt to obstacles, irregular room dimensions, or varying cleanliness conditions.

In a similar way, when a language model is used for multi-step problem-solving, it must determine the appropriate action at any point t based on the context so far. This context consists of the entire sequence of actions and observations up to the most recent one—right before the next action is to be chosen. The challenge is to map this evolving context to an appropriate next action, a process governed by a policy, typically denoted as π (pi). However, this policy is often implicit and computationally intensive.

But here’s the twist: in language models, actions can also take the form of thoughts—textual reasoning steps that don’t directly trigger external tools or cause side effects. A “thought” doesn’t produce an observation but instead updates the internal context with useful inferences or structured reasoning. These thoughts might:

Decompose a high-level goal into smaller tasks,
Inject common-sense knowledge,
Integrate previous observations into planning,
Or summarize and refine a working hypothesis.

This is the essence of thinking out loud—the model simulates reasoning as it progresses through a task.

Thought → Action → Observation: A Reasoning Trajectory

Through repeated cycles of thinking, acting, and observing, a structured reasoning trajectory emerges. Importantly, the model itself decides when to think and when to act. It orchestrates its own problem-solving strategy, dynamically navigating between abstract reasoning and concrete tool use.

This self-directed cycle—core to the ReAct pattern—gives large language models their reasoning superpowers.

Stranger Things Fact or Fiction

FEVER (Fact Extraction and VERification) is a dataset consisting of 185,445 claims, created by modifying sentences extracted from Wikipedia. These claims were then verified by annotators who did not have access to the original sentences. Each claim is labeled as Supported, Refuted, or Not Enough Info. For claims labeled Supported or Refuted, annotators also recorded the specific sentence(s) that served as the evidence for their judgment.

These FEVER claims make excellent test cases for ReAct-style reasoning. Can a language model reason its way to the correct classification using a sequence of thoughts and actions?

Let’s explore this with the claim:

“Stranger Things is set in Bloomington, Indiana.”

We allow the model to use a small, predefined set of three actions (described below).

Thought only:

Action only:

Thought and Action:

Now, consider the model’s behavior under different constraints:

When action is disallowed, the model must rely solely on its common-sense or pre-trained knowledge. It cannot interact with tools or external sources to gather new information, limiting its ability to refine or revise its thinking.
When only action is permitted—without reasoning steps—the model can gather facts, but it may struggle to integrate the new information or plan its use effectively.
When reasoning and action are interleaved, the model can think, decide on a fact-finding step, and then update its internal context based on what it observes. This thought → action → observation loop is where the full power of ReAct emerges: the model continuously adapts its reasoning as it uncovers new evidence.

This interplay—between what the model knows, what it does, and how it integrates observations—forms the foundation of dynamic, tool-augmented reasoning.

Agentic AI Blueprints of an Agent

The ReAct pattern unlocks a pathway for language models to exhibit structured reasoning behavior in service of problem-solving. Through few-shot prompting, models can be guided with curated examples to follow multi-step thought-action-observation sequences. Even more powerfully, the model’s capabilities can be systematically enhanced through fine-tuning on synthetically generated datasets designed to reflect complex reasoning tasks.

This training approach enables models to formulate dynamic plans, make decisions based on a finite set of available tools, and continuously adapt their strategies as new observations emerge from tool use. It transforms static question-answering into a fluid, adaptive process.

As Andrej Karpathy aptly put it:

“The future is Agentic!”

December 23 2024

Who Took the Cookie From the Cookie-Jar? Session Management in Scraping

sanjaybhatikar Blog Asynchronous I/O, Concurrency, cookies, Earnings Call Transcript, Motley Fool, Python, requests, Scrapy, session management, Single Page Application (SPA), web-scraping

Memoryless Protocol Need for Session Management

HTTP is a stateless (or memoryless) protocol, meaning it does not inherently retain information about previous requests or sessions. So, how does a server manage simultaneous sessions with multiple clients? This is where **session management** comes in.

To track sessions, the server assigns each client a unique “fingerprint,” typically in the form of a small piece of data called a **cookie**. This cookie is sent back to the server with every subsequent request, enabling the server to identify and maintain the session context.

For web scraping, where a browser may not be used, mimicking session management—such as storing and sending cookies with each request—is often critical for successful interactions with the website.

Scraping Fool Unique Challenges Scraping Financial Web-Sites

**Unique Challenges in Scraping Motley Fool’s Earnings Call Transcripts**

Motley Fool’s Earnings Call Transcripts page operates as a Single Page Application (SPA), where only 20 titles load initially, and more titles can be dynamically fetched by scrolling and clicking the “Load More” button. Behind the scenes, this functionality leverages a REST API endpoint (`filtered_articles_by_page/`) that provides additional content. To scrape such SPAs effectively, session management becomes crucial, as REST API requests require headers, including session cookies, which the browser automatically handles during normal usage.

The Load More button fetches more content to display on the front page. — MotleyFool’s earnings call transcripts is a Single Page Application that serves 20 titles to start. An additional 20 titles are fetched by clicking Load More and they are appended to the bottom of the page. More titles can be added by clicking again. Behind the scenes, the site consumes a Rest API to get content which is then inserted into the page.

Using browser developer tools, we can uncover the API endpoint and its request details. Copying the API request via “Copy as cURL” allows us to replicate it using Python’s `requests` library. By setting headers such as `User-Agent` and `Cookie`, we can authenticate your requests and fetch pages programmatically. Combining these steps, we can accumulate content across pages, extract transcript URLs and titles, and save full transcripts efficiently. This approach forms the basis for building a robust scraper for Motley Fool’s vast transcript collection.

Observe the network requests that are made. Can you find filtered_articles_by_page/?

The MotleyFool Earnings Call Transcripts page consumes a Rest API to work as a Single Page Application (SPA). We can examine the request URL and headers to learn about the API. To obtain headers, use the Copy as cURL feature of the reuqest in the Network tab of the browser’s Dev. Tools. Copy-paste the data. Observe that a cookie is set. The cookie is passed with each request. Mimicry of headers allows User-Agent to call the Rest API end-point successfully. Care must be taken to send the cookie correctly and it is best to let session management tools handle that.

Here is the cURL request mimicking the REST API call “under the hood” of the web-page.

	curl 'https://www.fool.com/earnings-call-transcripts/filtered_articles_by_page/?page=2' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'cookie: __gads=ID=cf780f167b238800:T=1716807838:RT=1716807838:S=ALNI_MYpbhjLpTZBFXkxImlqUPuqV2tQQQ; __eoi=ID=7c7662471063c10d:T=1716807838:RT=1717178637:S=AA-Afja2nU7mEA-fuyorwkgnPk7j; OptanonAlertBoxClosed=2024-05-31T18:04:06.100Z; eupubconsent-v2=CP_d7iQP_d7iQAcABBENA3EgAAAAAAAAACiQAAAAAAAA.YAAAAAAAAAAA; sessionid=5kl8jmdmsgf3zudrk6k0mddf4gsebdwh; cf_clearance=5A0SbTbHIWTcDCzDg7JTB8G6gQ727lPdAca8V4eKOpQ-1717933880-1.0.1.1-oaWRIaUD1Qr2dKA2s1oUvR09OLyZh0neBcBVJJqdmjUgOUQRjd2XYGvjXkbp0acQLENn89LE9T_UnKzyb04m3A; OptanonConsent=isGpcEnabled=0&datestamp=Sun+Jun+09+2024+17%3A21%3A21+GMT%2B0530+(India+Standard+Time)&version=202405.1.0&browserGpcFlag=0&isIABGlobal=false&hosts=&consentId=6d2f3605-db22-4f40-b460-1c0c910b1b91&interactionCount=2&landingPath=NotLandingPage&groups=C0002%3A0%2CC0004%3A0%2CC0003%3A0%2CC0001%3A1%2CV2STACK42%3A0&geolocation=SG%3B&isAnonUser=1&AwaitingReconsent=false&intType=2; Visit=visit=df301d21-d4b3-4d79-a6be-c88c658e9dfc&first_article_in_session=0&first_marketing_page=0; Visitor=uid=&username=&account=&registered=false&ecapped=false&dskPrf=false&version=7&visits=4&visitor=298bccb7-6d49-40d2-bccf-82978ce0b884' \
  -H 'priority: u=1, i' \
  -H 'referer: https://www.fool.com/earnings-call-transcripts/' \
  -H 'sec-ch-ua: "Chromium";v="124", "Opera";v="110", "Not-A.Brand";v="99"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "macOS"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 OPR/110.0.0.0' \
  -H 'x-requested-with: fetch'

And here is the same call made by using the requests library in Python.

	import requests

url = 'https://www.fool.com/earnings-call-transcripts/filtered_articles_by_page/?page=2'

headers = {
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.9',
    'cookie': '__gads=ID=cf780f167b238800:T=1716807838:RT=1716807838:S=ALNI_MYpbhjLpTZBFXkxImlqUPuqV2tQQQ; sessionid=56q1rnvexhexel2fnf09c07yp3mrdp6o; cf_clearance=fle0QL0LH3Tc9fgIWEQ9UxUY9H2PpJwZI9ThR3E07LI-1717178631-1.0.1.1-vaaZWQmq8.KrWi.mY8OJ.uq7AhLjuH1c8KJ9MXvM_ExWuMZHYrhfzwv7CrfV_d7uONOmS3HHmcJNLdBo01lUCQ; __eoi=ID=7c7662471063c10d:T=1716807838:RT=1717178637:S=AA-Afja2nU7mEA-fuyorwkgnPk7j; Visit=visit=5832172d-76a7-4aa8-9ecf-a79f254abebd&first_article_in_session=0&first_marketing_page=0; Visitor=uid=&username=&account=&registered=false&ecapped=false&dskPrf=false&version=7&visits=2&visitor=298bccb7-6d49-40d2-bccf-82978ce0b884; OptanonAlertBoxClosed=2024-05-31T18:04:06.100Z; eupubconsent-v2=CP_d7iQP_d7iQAcABBENA3EgAAAAAAAAACiQAAAAAAAA.YAAAAAAAAAAA; ct=1; OptanonConsent=isGpcEnabled=0&datestamp=Fri+May+31+2024+23%3A42%3A24+GMT%2B0530+(India+Standard+Time)&version=202405.1.0&browserGpcFlag=0&isIABGlobal=false&hosts=&consentId=6d2f3605-db22-4f40-b460-1c0c910b1b91&interactionCount=2&landingPath=NotLandingPage&groups=C0002%3A0%2CC0004%3A0%2CC0003%3A0%2CC0001%3A1%2CV2STACK42%3A0&geolocation=SG%3B&isAnonUser=1&AwaitingReconsent=false&intType=2',
    'priority': 'u=1, i',
    'referer': 'https://www.fool.com/earnings-call-transcripts/',
    'sec-ch-ua': '"Chromium";v="124", "Opera";v="110", "Not-A.Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 OPR/110.0.0.0',
    'x-requested-with': 'fetch'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    print(response.text)  # or process the response as needed
else:
    print(f'Failed to retrieve data: {response.status_code}')

We can expand on this code to include features such as:

Use session manager to handle cookie implicitly.
Scrape the full transcripts given a range of page numbers.
Save scraped transcripts to disk.

Here is what the sample code looks like with all these features implemented.

	import requests
import html
from lxml import etree

# Function to clean file name
def clean_filename(title):
    return "".join(c if c.isalnum() or c in " ._-()" else "_" for c in title)

# Define the number of pages to retrieve
i = 402
n = 2

# Step 1: Create a session
session = requests.Session()

# Step 2: Visit the first page to set cookies
initial_url = 'https://www.fool.com/earnings-call-transcripts/'
response = session.get(initial_url, headers={
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.9',
    'referer': 'https://www.fool.com/',
    'sec-ch-ua': '"Chromium";v="124", "Opera";v="110", "Not-A.Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 OPR/110.0.0.0',
})

# Check if the initial request was successful
if response.status_code == 200:
    print("Initial request successful. Cookies set.")
    
    # Prepare to accumulate HTML content
    accumulated_html = '<html><head><title>Earnings Call Transcripts</title></head><body>'
    
    # Step 3: Loop through pages to gather content
    for page in range(i, n + i):
        next_url = f'https://www.fool.com/earnings-call-transcripts/filtered_articles_by_page/?page={page}'
        response = session.get(next_url, headers={
            'accept': '*/*',
            'accept-language': 'en-US,en;q=0.9',
            'referer': 'https://www.fool.com/earnings-call-transcripts/',
            'sec-ch-ua': '"Chromium";v="124", "Opera";v="110", "Not-A.Brand";v="99"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"macOS"',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 OPR/110.0.0.0',
            'x-requested-with': 'fetch'
        })
        
        if response.status_code == 200:
            # Parse the JSON response
            data = response.json()
            
            # Extract the HTML content
            raw_html = data.get('html', '')
            
            # Unescape the HTML content
            clean_html = html.unescape(raw_html)
            
            # Accumulate the HTML content
            accumulated_html += clean_html
        else:
            print(f'Failed to retrieve page {page}: {response.status_code}')
    
    # Close the accumulated HTML content
    accumulated_html += '</body></html>'
    
    # Parse the accumulated HTML content
    tree = etree.HTML(accumulated_html)
    
    # Extract all transcript URLs and titles
    transcripts = tree.xpath('//div[@class="page"]//div[contains(@class, "flex")]')
    
    for transcript in transcripts:
        # Extract URL
        url = transcript.xpath('.//a[starts-with(@href, "/earnings")]/@href')[0]
        full_url = f'https://www.fool.com{url}'
        
        # Extract title
        title = transcript.xpath('.//h5/text()')[0]
        clean_title = clean_filename(title)
        
        # Fetch transcript page
        transcript_response = session.get(full_url, headers={
            'accept': '*/*',
            'accept-language': 'en-US,en;q=0.9',
            'referer': 'https://www.fool.com/earnings-call-transcripts/',
            'sec-ch-ua': '"Chromium";v="124", "Opera";v="110", "Not-A.Brand";v="99"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"macOS"',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 OPR/110.0.0.0',
        })
        
        if transcript_response.status_code == 200:
            # Save the transcript content to a file
            with open(f'{clean_title}.html', 'w') as file:
                file.write(transcript_response.text)
            
            print(f'Saved transcript: {clean_title}.html')
        else:
            print(f'Failed to retrieve transcript: {full_url}')
else:
    print(f'Initial request failed: {response.status_code}')

The process of scraping Motley Fool’s Earnings Call Transcripts is a two-step process:

1. Scrape Titles and Links: First, we scrape the pages that contain the titles of the transcripts along with the URLs. This is done by making requests to the API endpoint, which dynamically loads additional titles as we navigate through the pages. The HTML content for each page is extracted from the JSON response, and the URLs and titles are accumulated into a list.

2. Fetch Full Transcripts: Once we have the list of titles and URLs, the second step is to visit each transcript URL to retrieve the full content. For each URL, a request is made, and the complete transcript is fetched and saved to a file, named after the transcript title.

This two-step process enables us to gather both the transcript titles and their corresponding full content for further processing or storage. The step-step development is presented in this companion Kaggle notebook. A more detailed breakdown of steps is as follows:

Session Initialization and Cookie Setup: A requests.Session object is used to automatically manage cookies. The session is created, and an initial request is made to the URL to set the necessary cookies.
Fetching and Accumulating Content: The script loops through a specified number of pages, fetching HTML content embedded in a JSON response. The content of each page is extracted from the html key, unescaped, and then accumulated into a single HTML string.
Parsing the Combined HTML: The combined HTML content of several pages is then parsed using lxml.etree.HTMLto extract the capsules, one for each transcript title.
Extracting URLs and Titles: The script iterates through each capsule, unpacking the title of each transcript and the accompanying metadata which includes the URL link to the full document.
Fetching and Saving Transcripts: With each transcript URL thus obtained, the full document is fetched and saved to a file named after the transcript’s title.

This process lays the foundation for building a scraper or spider to automatically collect and store transcripts.

Scrapy Session Downloader Middlewares

The implementation of a spider in Scrapy to crawl the web is illustrated in the fig.

Here’s how Scrapy crawls the web:

When a Scrapy spider starts, it is provided with one or more start_urls, which is an attribute of the spider class. For simplicity, let’s assume there’s just one URL in start_urls. Scrapy sends a request to this URL, and the response is passed to the spider’s parse method.

In the parse method, the content of the response is processed, and this triggers additional requests for links found in the page. These follow-up requests are handled by callback functions, which are also defined as class methods. This process allows the spider to recursively follow links and scrape multiple pages across a website.

Scrapy uses asynchronous I/O, which means it can handle multiple requests concurrently rather than sequentially. This is evident from the use of the yield keyword instead of return in the `parse` method and its callbacks. By yielding requests, Scrapy doesn’t wait for each response before sending the next request, making the crawling process much faster.

One key benefit of Scrapy is that it automatically handles the initial request(s) defined in start_urls. You don’t need to explicitly make the first request in your code; Scrapy manages this, and the response is directly passed to the parse method. If there are multiple start_urls, Scrapy invokes the parse method once for each response, and the responses are handled in the order they are received.

Scrapy can also handle common scraping scenarios, such as pagination, where the parse method may be recursively called to fetch the next page as long as there is a “next” link. Scrapy supports both XPath and CSS selectors for parsing HTML, making it highly flexible in extracting content.

In our Kaggle notebook, we used a requests.Session object to manage cookies automatically. But how does Scrapy handle session management? Let’s explore how Scrapy manages cookies and sessions to ensure consistency in requests.

We do it using a middleware component that comes with Scrapy. This is the CookiesMiddleware. According to Scrapy docs:

This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers, and sends them back on subsequent requests (from that spider), just like web browsers do.

We use the middleware component, enable it in “settings.py” as follows:

	# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   "bookworm.middlewares.BookwormDownloaderMiddleware": 543,
   "scrapy.downloadermiddlewares.cookies.CookiesMiddleware": 345
}

We also need to enable cookies like so:

	# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
COOKIES_ENABLED = True
COOKIES_DEBUG = True

Scrape The Fool — *Scrapy spider crawls The Fool*

Here’s a clearer explanation of how the Scrapy spider crawls The Fool:

Request the Landing Page: Scrapy starts by requesting the landing page of The Fool. Cookie is set here. As such, no content of the landing page is consumed.
Parse Landing Page: The response is passed to the parse callback, where in departure from convention, the page URLs are synthetically generated and each page requested. Note that in yet another departure from convention, a page URL is a REST API endpoint with JSON response.
Handle Page Responses: Each JSON payload from step 2 is passed to the parse_page callback. The JSON has the HTML content of a single page stashed under the key html. The HTML is extracted, cleaned, and parsed to yield the titles of full transcripts (20 per page), each with accompanying URL to the document. Iterates over the set of URLs triggering the next wave of requests.
Handle Transcript Responses: The response from each request in step 3, is passed to the parse_transcript callback. This response contains HTML from which the full text of an earnings call transcript is scraped.
Pipeline Processing: The Scrapy pipeline is then used to store the scraped transcript for further processing or saving.

With session management handled by the Scrapy middleware, The Fool’s extensive collection of earnings call transcripts is now ready for scraping and storing.

Conclusion Key Takeaways

AI/ML projects depend on good quality data. When we scrape information from the web to construct datasets, we rely on spiders to crawl the web efficiently and purposefully. However, each website is uniquely constructed and the manner of its construction can put barriers in the way of scraping. The Fool, a web-site with data upon financial markets, presents a case-study of the kind of challenges that crop up. We have shown how challenges posed by session cookies can be handled with ease and elegance in Scrapy using middleware. The lessons are broadly applicable to scraping sites that use cookies in browser sessions.

December 23 2024

Tee and Toast With My Post

sanjaybhatikar Blog Linux, PostgreSQL, tee, TOAST

Writer's Block My database is toast?

I had a heart-stopping moment when my Postgres database stopped allowing writes disrupting a critical ETL job. Postgres croaked as follows:

	Err: index "pg_toast_36514_index" contains unexpected zero page at block 121
HINT:  Please REINDEX it.

What gives?

Toast on the Table What, Why & How of Toast Tables

What are TOAST Tables in PostgreSQL?

TOAST stands for The Oversized-Attribute Storage Technique. It is a mechanism in PostgreSQL to handle large values that do not fit into a single page. PostgreSQL uses pages to store data, with a typical page size of 8KB. When a data value exceeds this size, PostgreSQL uses TOAST to store the value in a separate table.

Why do we need TOAST?

A page is a fixed-size block of data storage in a database. It is the basic unit of I/O operation. In PostgreSQL, the default page size is 8KB.

Pages store rows of a table. Each page contains metadata and the actual data. If a row is too large to fit in a single page, special handling is required.

How does TOAST work?

TOAST is used to handle large data values (such as large text or binary data) that cannot fit in a single page. When a data value exceeds the TOAST threshold (typically around 2KB for a default 8KB page size), PostgreSQL stores the value in a TOAST table. The main table stores a pointer to the TOAST table where the actual data is stored.

Large values are automatically compressed and divided into chunks. Each chunk is stored as a row in the TOAST table. This approach allows efficient storage and retrieval of large data values.

Burnt Toast What can go wrong?

Root Cause Analysis

The error message indicates corrupted index. Corruption can occur due to several reasons:

Improper Shutdown: If a PostgreSQL instance is not shut down properly (e.g., due to a power failure or improper Docker container shutdown), it can lead to corruption.
Hardware Issues: Disk failures or memory issues can also cause corruption.
Software Bugs: Bugs in PostgreSQL or related software can sometimes lead to data corruption.

Symptoms of Corruption

Corruption in TOAST tables can manifest as errors when accessing large values. The error message I encountered indicates that an unexpected zero page was found in a TOAST table, suggesting corruption.

Toast of the Town Resolve the issue by reindexing

Reindexing can often resolve issues with corrupted indexes. Since the error suggests an index issue with a TOAST table, reindexing the table may help:

	REINDEX TABLE pg_toast.pg_toast_<oid>;

Replace <oid> with the actual object identifier of the TOAST table. The step-step process is as follows:

Step 1: Identify the TOAST table and main table. We can do this as follows:

	SELECT
    reltoastrelid::regclass AS toast_table,
    relname AS main_table
FROM
    pg_class
WHERE
    relname = 'transcript';

The output should gives us the names of the TOAST table (e.g. pg_toast.pg_toast_36514) and main table (e.g. transcript).

Explanation: In PostgreSQL, if any of the columns of a table are TOAST-able, the table will have an associated TOAST table, whose Object Identifier (OID) is stored in the table’s pg_class.reltoastrelid entry. Further, we use ::regclass as a type cast to convert the OID to a human-readable name. This mechanism is particularly useful when dealing with system catalogs like pg_class, where relations (table, index, or sequence) are typically referenced by their OIDs.

Try:

`SELECT reltoastrelid::regclass AS toast_table FROM pg_class WHERE relname = ‘transcript’;`

Contrast with:

`SELECT reltoastrelid AS toast_table FROM pg_class WHERE relname = ‘transcript’;`

Step 2: Reindex the TOAST table.

	REINDEX TABLE pg_toast.pg_toast_36514;

Step 3: Reindex the main table.

	REINDEX TABLE public.transcript;

Putting these steps together, we have the following code:

	-- Step 1: Identify the TOAST table and main table
SELECT
    reltoastrelid::regclass AS toast_table,
    relname AS main_table
FROM
    pg_class
WHERE
    relname = 'transcript';

-- Output should give the TOAST table name, e.g., pg_toast.pg_toast_36514

-- Step 2: Reindex the TOAST table
REINDEX TABLE pg_toast.pg_toast_36514;

-- Step 3: Reindex the main table
REINDEX TABLE public.transcript

This fixed the issue so I could restart the ETL job and resume writes.

Toasty Takeaways Backups & Preventive Measures

Restore From Backup

Reindexing may not always resolve the issue. In that case, our only recourse is to restore the database from a backup. Regular backups are crucial for recovering from corruption.

Preventive Measures

In addition to regular backups, the following preventive measures can minimize the likelihood of corruption:

Proper Shutdown: Always shut down PostgreSQL instances properly using commands like pg_ctl stop or docker stop for Docker containers.
Regular Maintenance: Regularly vacuum and analyze the database to keep it in good health.
Monitoring: Use monitoring tools to keep an eye on the database’s health and performance.

Summary

TOAST handles large values that don’t fit in a single page by storing them in a separate table. Here, pages are fixed-size blocks of data storage, typically 8KB in PostgreSQL. We encountered corruption which may have arisen out of improper shutdown. Among other causes of corruption are hardware issues and software bugs. We used reindexing to resolve the issue. When reindexing fails, our only recourse is restoring from backup, so it is a good practice to take regular backups.

Improper shutdown of the Docker container could indeed be the cause of the corruption. It is crucial to always shut down PostgreSQL and Docker containers properly to avoid such issues.

Tee Totaller Logging messages to a file while viewing them in console

The ETL job was a Scrapy spider that crawled websites for news about financial markets, extracting transcripts, applying transformations to structure the text for storage, and loading the information into the Postgres data warehouse. It’s important to use a standard logging module, like Python’s logging, for job monitoring and analysis. Scrapy documentation provides information about logging from spiders.

By default, the logger output is sent to standard error. To save it to disk, I redirected the stream to a text file, but then the messages no longer appeared in my console. How can we save the log messages to a file while also viewing them in the console? This is where the tee command comes in handy.

The tee command allows us to see real-time logs in the console and simultaneously save them to a file. Here is a simulation to illustrate its use:

	for i in {1..10}; do echo "Line $i"; sleep 1; done | tee long_output.txt

This command will print “Line 1” to “Line 10” with a 1-second delay between each, displaying the output in the console and saving it to long_output.txt. While the above command is running, in another terminal, we can monitor the output file:

	tail -f long_output.txt

This will show the contents of long_output.txt in real-time as new lines are added. Here’s how to use tee to allow us to see real-time logs in the console and simultaneously save them to a file.

Basic Usage

	scrapy crawl spiderfool -a start_page=210 -a number_pages=20 | tee out.txt

This command runs the Scrapy spider with some custom arguments and uses tee to display log messages in the console while saving them to out.txt.

Appending to a Log File

To append the output to an existing log file instead of overwriting it, use the -a flag:

	scrapy crawl spiderfool -a start_page=210 -a number_pages=20 | tee -a out.txt

Monitoring Log File in Real-Time

While the spider is running, we can monitor the log file in real-time using the tail command:

	tail -f out.txt

This command will display new log messages added to out.txt as they appear.

By using tee, we can effectively monitor our ETL job, debug issues in real-time, and keep a log of our output for further analysis.

References Reading List

Chapter on Database Physical Storage with TOAST from PostgresSQL.org

June 20 2024

Unlocking the Power of Language: Leveraging Large Language Models for Next-Gen Semantic Search and Real-World Applications

sanjaybhatikar AI, Generative AI, Python, PyTorch Artificial Intelligence, embedding, Generative AI, GPU, Large Language Model, LLM, Neural Network, Prompt Engineering, RAG, Retrieval Augmented Generation, Semantic Search, vector database

Invited talk at Calfus, Pune, June 20, 2024.

May 30 2024

Mindful Coding in Headless Mode a.k.a. Kit Out Your EC2 Instance With a Dockerized IDE

sanjaybhatikar Blog AWS, CLI, Docker, EC2, IDE, Theia, VSCode

Headless Is Not a Blind Alley Developing with an IDE on AWS EC2

I acquired an EC2 instance on Amazon’s cloud for building Generative AI models, gaining access to a Tesla T4 GPU and ample GPU RAM for deploying a Large Language Model (LLM) for an app developed for an investment banking firm. Access to the EC2 instance is primarily through the terminal on my MacBook, which starkly contrasts with the familiar Visual Studio Code (VS Code) environment that offers convenient integrations with GitHub and Docker.

While the Command Line Interface (CLI) is a valuable tool for developers, it can be cumbersome for tasks that benefit from a more visual interface. For quick changes to code moved from development to production environments, the EC2 instance running Ubuntu 22.04 OS offered only the Nano editor. Though Nano is a perfectly good editor, I wanted something closer to VS Code. The solution? A light version of the Theia IDE in Docker, which perfectly met my needs.

Ports, Posts and Portability Expose ports on EC2 instance

To access the EC2 instance, I SSH into it from an open terminal on my MacBook, using a downloaded .pem file for authentication. The command is as follows

	ssh -i ~/.ssh/secret.pem deusexmachina@45.245.72.29

A dockerized app typically listens for incoming requests at the host IP address on a specific port. When the app has a web UI, it can be accessed in a web browser using the IP address and port number, allowing access from any location on the web that can reach the IP address.

For our EC2 instance, we can find the required public IP address using AWS CLI. As for the port used by the app, we must ensure it is exposed to the public internet. We will use AWS CLI to check which ports are exposed to the public internet. If the required port is not exposed, we will use AWS CLI to expose it. Ensure you have AWS CLI installed on your computer; if not, follow the instructions here.

	aws ec2 describe-security-groups --group-ids sg-08x6bdf6x665XcxeX | \
jq '.SecurityGroups[0].IpPermissions[] | select(.IpRanges[].CidrIp == "0.0.0.0/0")'

The command retrieves information about a specific security group in AWS EC2 and uses jq to filter the results. Let’s break down each part of the command:

AWS CLI Command

aws ec2 describe-security-groups --group-ids sg-08x6bdf6x665XcxeX

aws ec2 describe-security-groups: This is an AWS CLI command used to retrieve details about one or more security groups.
--group-ids sg-08x6bdf6x665XcxeX: This option specifies the security group ID for which details are being requested. In this case, the security group ID is sg-08x6bdf6x665XcxeX.

This command outputs information about the specified security group in JSON format.

Piping to jq

| jq '.SecurityGroups[0].IpPermissions[] | select(.IpRanges[].CidrIp == "0.0.0.0/0")'

| jq: This pipe sends the output of the aws ec2 describe-security-groups command to jq, which is a command-line JSON processor.
.SecurityGroups[0]: This selects the first (and likely only) security group from the list of security groups returned by the AWS CLI command.
.IpPermissions[]: This iterates over each IP permission rule within the selected security group. Each security group can have multiple IP permission rules, which specify which IP ranges are allowed or denied access.
select(.IpRanges[].CidrIp == "0.0.0.0/0"): This select function filters the IP permission rules to find those that have an IP range (CIDR block) of 0.0.0.0/0. The 0.0.0.0/0 CIDR block represents the entire IPv4 address space, indicating a rule that applies to all IP addresses.

Combining these parts, the full command performs the following actions:

Retrieve Security Group Information: It fetches details about the security group with ID sg-08x6bdf6x665XcxeX using the AWS CLI.
Process JSON Output with jq: It processes this JSON output to:
– Access the first security group in the list.
– Iterate over its IP permission rules.
– Filter these rules to select only those that allow access from any IP address (0.0.0.0/0).

Here is the JSON indicative of a rule allowing TCP traffic on port 22 (typically used for SSH) from any IP address.

	{
  "IpProtocol": "tcp",
  "FromPort": 22,
  "ToPort": 22,
  "IpRanges": [
    {
      "CidrIp": "0.0.0.0/0"
    }
  ],
}

What if the port we need is not open? Here is the command to open a port.

	aws ec2 authorize-security-group-ingress --group-id sg-08x6bdf6x665XcxeX \
	--protocol tcp --port 8081 --cidr 0.0.0.0/0

The aws ec2 authorize-security-group-ingress command adds an ingress rule to a specified security group. An ingress rule specifies the type of incoming traffic that is allowed to reach instances associated with the security group. We have used the command with options and arguments as follows:

--group-id sg-08x6bdf6x665XcxeX: This option specifies the ID of the security group to which you want to add the ingress rule. In this case, the security group ID is sg-08x6bdf6x665XcxeX.
--protocol tcp: This option specifies the protocol for the rule. Here, tcp indicates that the rule applies to the TCP protocol.
--port 8081: This option specifies the port number to which the rule applies. In this case, it is port 8081.
--cidr 0.0.0.0/0: This option specifies the CIDR block (range of IP addresses) that is allowed by this rule. 0.0.0.0/0 represents all IP addresses, meaning the rule allows incoming traffic from any IP address.

Putting it all together, the command:

Modifies the Security Group: It targets the security group with ID sg-08x6bdf6x665XcxeX.
Specifies the Protocol: It applies to the TCP protocol.
Defines the Port: It applies to port 8081.
Sets the IP Range: It allows traffic from any IP address (0.0.0.0/0).

By running this command, we modify the specified security group to allow incoming TCP traffic on port 8081 from any IP address. In this way, we allow access to an application that is listening on port 8081 to users from anywhere on the internet.

Note on Security:

Allowing traffic from 0.0.0.0/0 means that the port is open to the entire internet. This can be a significant security risk. It is generally advisable to restrict the CIDR range to specific IP addresses or ranges that require access, rather than opening it up to all addresses.

Theia Use Case:

The Theia IDE runs as a web application that listens on port 3000. By running the command: aws ec2 describe-security-groups --group-ids sg-08x6bdf6x665XcxeX | jq '.SecurityGroups[0].IpPermissions[] | select(.IpRanges[].CidrIp == "0.0.0.0/0").FromPort', I found that port 3000 was among those open to TCP traffic. However, it was already in use by the web application Grafana. To resolve this, I decided to map port 3000 on Theia’s Docker container to port 3030 on the EC2 instance. I then ran the command to modify the security group, allowing HTTP (or any TCP) traffic to reach Theia on port 3030 from any IP address.

Let’s look at the installation process next.

Tango With Theia Installation with Docker

Here are the steps to install Theia:

Pull the docker image: docker pull t1m0thyj/theia-alpine
Spin up the docker container:

	docker run -d --init \
  --name theia_service \
  --restart always \
  -p 3030:3000 \
  -v /home/ubuntu:/home/project t1m0thyj/theia-alpine

The docker run command creates a container named theia_service and maps port 3030 on the EC2 instance to port 3000 on the container. The image t1m0thyj/theia-alpine is a lightweight version of the Theia IDE. With the container up and running, as verified by docker ps, we can access Theia using the URL http://45.245.72.29:3030 in a web browser. Here is what that looks like in Safari on my MacBook.

Use Docker to install Theia IDE on AWS EC2 instance. Since the container needs write permission to create or update text files on the shared volume, the directory permissions must be correctly set. Further, it may be necessary to create a user on the container with the same UID/GID as the user on the host. This can be conveniently done with options and arguments passed to the docker run command.

Using Docker to install Theia IDE on an AWS EC2 instance involves ensuring the container has the necessary permissions to create or update text files on the shared volume. The directory permissions must be correctly set, and it may be necessary to create a user on the container with the same UID/GID as the user on the host. This can be conveniently done with options and arguments passed to the docker run command.

Although the dockerized Theia running on EC2 was accessible this way, it did not allow creating or modifying files. In other words, it functioned as a read-only IDE!

Who Am I? User Write Permissions Issue

Using Docker to install Theia IDE on an AWS EC2 instance involves ensuring the container has the necessary permissions to create or update text files on the shared volume. The directory permissions must be correctly set, and it may be necessary to create a user on the container with the same UID/GID as the user on the host. Looking into the docker logs with the command docker logs theia_service on the EC2 machine, it became apparent that the container did not have sufficient permissions on the shared volume.

Ensure that the shared folder between the EC2 host and docker container has the correct ownership and permissions with the following commands:

	sudo chown -R ubuntu:ubuntu /home/ubuntu
sudo chmod -R 775 /home/ubuntu

Here, the logged-in user on the EC2 instance has name ubuntu. This can be verified by running the whoami command at the terminal prompt. Alternatively, use the $USER variable in the command like so: sudo chown -R $USER:$USER /home/ubuntu.

This alone wasn’t enough to fix the issue. Running the whoami command inside the Docker container from an interactive shell revealed the container user to be theia. The fix to the permissions error was to ensure that the container user theia has the same permissions as the instance user ubuntu. This is possible with a small modification to the options and arguments passed to the docker run command as follows:

	docker run --rm --init \
  --name theia_service_temp \
  -d -p 3030:3000 \
  -v /home/ubuntu/:/home/project \
  -u $(id -u ubuntu):$(id -g ubuntu) \
  t1m0thyj/theia-alpine

This command spins up a temporary container with --rm for testing the approach. The container is removed upon exit.

The user on the docker container may not always align with the user on the host instance. This may prevent the application running in the container from creating or modifying files on the folder shared with the host. First, ensure that the folder on the host has the correct ownership and permissions. If this doesn’t solve the problem, then further steps are needed to ensure that the user inside the container has the same permissions as the user on the host.

A tempfile was created using the IDE with content “Namaste ji!”. The changes were verified using cat \home\ubuntu\templfile on the host instance. Then the contents of the file were modified and changes verified. With this validation of the approach to fix the permissions error, the final docker run command was as follows:

	docker run --init \
  --name theia_service \
  --restart always \
  -d -p 3030:3000 \
  -v /home/ubuntu/:/home/project \
  -u $(id -u ubuntu):$(id -g ubuntu) \
  t1m0thyj/theia-alpine

Conclusions Key Takeaways

Going headless doesn’t mean having to forgo an IDE. With Docker, the installation process can be painless and fruitful.

Join our FastAI class to learn from experienced instructors who excel not only in building state-of-the-art models but also in tackling complex engineering challenges in real-world scenarios. Our instructors bring a wealth of expertise from successfully deploying applications that solve tangible problems and reach customers effectively. Through hands-on projects and personalized guidance, you’ll gain practical skills in AI and machine learning, ensuring you’re equipped to create impactful solutions in your career. Whether you’re aiming to advance your knowledge or pivot into a new field, our FastAI class offers the perfect blend of theoretical foundations and real-world application to help you succeed.

May 22 2024

An Interesting SQL Query or Two

sanjaybhatikar Blog Grafana, postgres, SQL

SQL is described as a primitive programming language. The vocabulary of SQL, while compact, can still work wonders to shape data into a suitable form for visualization. One has only to be creative! Here is an example!

Expense Tracking Use-Case DB and Visualizations

I have an app that records daily expenses. The Postgres DB saves each expense as a line item with the amount and date-time. Further, the expense is annotated with tags in key-value form to describe WHY (purpose of the expense), WHO (vendor engaged), WHERE (transaction location), etc.

I wanted two visualizations as follows:

Track the cumulative expenses over days of the current month to compare with the previous month’s.
Track month-month expenses in key categories.

DB Schema for Expense Tracking:

An expense, HOW MUCH and WHEN, is recorded with tags to annotate with information such as WHY the expense was incurred, WHO was the vendor engaged and WHERE the transaction occurred.

Cumulative Expenses:

View the cumulative expense over the days of the current month and compare with the previous month by juxtaposition.

The visualization requires a table with 31 rows for the days of the month and a column with cumulative expenses for current month and another column with the same data for the previous month.

Categorical Expenses:

Track expenses in key categories such as groceries, restaurant, utilities, etc. month-month through the year. Showing only data for April and May.

The visualization requires a table with a row for each month to be shown (2 months in the illustration – March, April) and one column for each category in which expenses are summarized.

SQL Pivots Split-Apply-Combine Strategy for Data Analysis

Here is the query for the expense summaries by category for each month.

	SELECT EXTRACT(YEAR FROM i.dt_incurred) AS year,
       EXTRACT(MONTH FROM i.dt_incurred) AS month,
       SUM(CASE WHEN t.description = 'groceries' THEN i.amount_in_rs ELSE 0 END) AS groceries,
       SUM(CASE WHEN t.description = 'restaurant' THEN i.amount_in_rs ELSE 0 END) AS restaurant,
       SUM(CASE WHEN t.description IN ('electricity', 'water') THEN i.amount_in_rs ELSE 0 END) AS utilities,
       SUM(CASE WHEN t.description IN ('internet', 'mobile phone', 'cable TV') THEN i.amount_in_rs ELSE 0 END) AS communication,
       SUM(CASE WHEN t.description LIKE '%services%' THEN i.amount_in_rs ELSE 0 END) AS services
FROM item i
JOIN item_tag it ON i.id = it.item_id
JOIN tag t ON it.tag_id = t.id
WHERE t.name = 'purpose'
  AND EXTRACT(YEAR FROM i.dt_incurred) = EXTRACT(YEAR FROM CURRENT_DATE)
GROUP BY EXTRACT(YEAR FROM i.dt_incurred), EXTRACT(MONTH FROM i.dt_incurred)
ORDER BY EXTRACT(YEAR FROM i.dt_incurred), EXTRACT(MONTH FROM i.dt_incurred);

This produces the desired table as shown in the figure below.

The SQL query groups rows by month of the year, using the date to extract month and year. The tag table is joined on the item table and the query sets filter conditions to summarize amounts in the category that fits while zero-ing all other amounts.

We can understand this query with a simpler version that takes only one category to summarize expenses in current month.

	SELECT SUM(i.amount_in_rs) AS total_amount
FROM item i
JOIN item_tag it ON i.id = it.item_id
JOIN tag t ON it.tag_id = t.id
WHERE t.name = 'purpose'
  AND t.description = 'clothes'
  AND EXTRACT(MONTH FROM i.dt_incurred) = EXTRACT(MONTH FROM CURRENT_DATE)
  AND EXTRACT(YEAR FROM i.dt_incurred) = EXTRACT(YEAR FROM CURRENT_DATE);

The code allows only the current month and year and then selects rows tagged with ‘purpose’ as ‘clothes’. In the final version, we work off the same JOIN and add heft to the query to generate summaries in multiple categories of interest for each month. This is a variation on the Split-Apply-Combine strategy for Data Analysis, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

The main query builds on this to generate the categorical summaries of each month in the current year that data are available for!

Cumulative Sum Split-Apply-Combine After Self-Join

Here is the query for cumulative sum.

	WITH days AS (
    SELECT generate_series(1, 31) AS day_of_month
),
current_month_expenses_by_day AS (
    SELECT
        EXTRACT(DAY FROM dt_incurred) AS day_of_month,
        SUM(amount_in_rs) AS daily_expense
    FROM item
    WHERE EXTRACT(YEAR FROM dt_incurred) = EXTRACT(YEAR FROM CURRENT_DATE)
      AND EXTRACT(MONTH FROM dt_incurred) = EXTRACT(MONTH FROM CURRENT_DATE)
    GROUP BY EXTRACT(DAY FROM dt_incurred)
),
previous_month_expenses_by_day AS (
    SELECT
        EXTRACT(DAY FROM dt_incurred) AS day_of_month,
        SUM(amount_in_rs) AS daily_expense
    FROM item
    WHERE EXTRACT(YEAR FROM dt_incurred) = EXTRACT(YEAR FROM CURRENT_DATE - INTERVAL '1 month')
      AND EXTRACT(MONTH FROM dt_incurred) = EXTRACT(MONTH FROM CURRENT_DATE - INTERVAL '1 month')
    GROUP BY EXTRACT(DAY FROM dt_incurred)
),
current_month_cumulative_expenses AS (
    SELECT
        d.day_of_month,
        COALESCE(e.daily_expense, 0) AS daily_expense
    FROM
        days d
    LEFT JOIN current_month_expenses_by_day e ON d.day_of_month = e.day_of_month
),
previous_month_cumulative_expenses AS (
    SELECT
        d.day_of_month,
        COALESCE(e.daily_expense, 0) AS daily_expense
    FROM
        days d
    LEFT JOIN previous_month_expenses_by_day e ON d.day_of_month = e.day_of_month
)
SELECT
    d.day_of_month,
    SUM(c.daily_expense) OVER (ORDER BY d.day_of_month) AS cum_this_month,
    SUM(p.daily_expense) OVER (ORDER BY d.day_of_month) AS cum_prev_month
FROM
    days d
LEFT JOIN current_month_cumulative_expenses c ON d.day_of_month = c.day_of_month
LEFT JOIN previous_month_cumulative_expenses p ON d.day_of_month = p.day_of_month
ORDER BY
    d.day_of_month;

It is easier to understand cumulative summation in SQL with a simpler example.

Example to Illustrate:

Assume we have the following table @t:

id	SomeNumt
1	10
2	20
3	30
4	40

The query to add a column containing the cumulative sum is as follows:

	select 
    t1.id, 
    t1.SomeNumt, 
    SUM(t2.SomeNumt) as sum
from 
    @t t1
inner join 
    @t t2 
    on t1.id >= t2.id
group by 
    t1.id, 
    t1.SomeNumt
order by 
    t1.id;

Table Aliases:

- The table @t is given aliases t1 and t2 to allow for a self-join. This is necessary to compare rows within the same table.

Self-Join Condition:

- inner join @t t2 on t1.id >= t2.id
- The join condition t1.id >= t2.id ensures that for each row in t1, it will match with all rows in t2 where t2.id is less than or equal to t1.id. This effectively creates pairs of rows where each row in t1 will pair with all preceding rows in t2 including itself.

Join Result:

The self-join on id with the condition t1.id >= t2.id will produce:

t1.id	t1.SomeNumt	t2.id	t2.SomeNumt
1	10	1	10
2	20	1	10
2	20	2	20
3	30	1	10
3	30	2	20
3	30	3	30
4	40	1	10
4	40	2	20
4	40	3	30
4	40	4	40

Grouping:

- group by t1.id, t1.SomeNumt
- After the join, the result set is grouped by t1.id and t1.SomeNumt. This ensures that we calculate the cumulative sum for each id in t1.

Cumulative Sum Calculation:

- SUM(t2.SomeNumt) as sum
- For each group (each unique id from t1), the query calculates the sum of SomeNumt from all the rows in t2 that are less than or equal to the current id in t1. This effectively gives us the cumulative sum up to the current id.

Grouping and Summing Result:

Grouping by t1.id and t1.SomeNumt and summing t2.SomeNumt:

t1.id	t1.SomeNumt	SUM(t2.SomeNumt)
1	10	10
2	20	30
3	30	60
4	40	100

This result shows the cumulative sum of SomeNumt up to each id. We have thus found a method of taking cumulative sum in SQL.

Recap:

For each id in t1, the join finds all preceding rows including itself in t2.
The sum function then calculates the total of SomeNumt for these rows.
Grouping ensures that this calculation is done separately for each id in t1.
The result is a cumulative sum of SomeNumt for each id, effectively showing how the sum accumulates as id increases.

While this is a clever solution, it has a downside that the operation has O(n^2) complexity. So the computational effort explodes exponentially. When our table grows to a million rows, we need a different approach!

Cumulative Sum Redux Sum Over Series

	SELECT
    id,
    SomeNumt,
    SUM(SomeNumt) OVER (ORDER BY id) AS cumulative_sum
FROM
    @t
ORDER BY
    id;

Optimized Query:
The optimized query uses the SUM OVER window function:

The SUM(SomeNumt) OVER (ORDER BY id) calculates a running total (cumulative sum) of SomeNumt values in the order of id.
This avoids the need for a self-join and instead uses a window function which is designed to efficiently compute cumulative sums.

Window functions perform a calculation across a set of table rows that are somehow related to the current row. Unlike aggregate functions, window functions do not cause rows to become grouped into a single output row.

How SUM OVER Works:

ORDER BY Clause: The ORDER BY id inside the OVER clause specifies the order in which the rows are processed for the sum.
Cumulative Sum: The SUM(SomeNumt) OVER (ORDER BY id) computes the cumulative sum by adding the current row’s SomeNumt to the sum of all previous rows’ SomeNumt.

Efficiency Gains: O(n) Complexity: The window function computes the cumulative sum in a single pass through the data, making it O(n) in complexity, significantly improving performance over the O(n^2) complexity of the self-join.

Cumulative Sum Sum Over Generated Series

Here, then, is how we query the cumulative expense over days of the current month.

	/*
Get cumulative expense by day of the month for current month.
*/
WITH days AS (
    SELECT generate_series(1, 31) AS day_of_month
),
expenses_by_day AS (
    SELECT
        EXTRACT(DAY FROM dt_incurred) AS day_of_month,
        SUM(amount_in_rs) AS daily_expense
    FROM item
    WHERE DATE_TRUNC('month', dt_incurred) = DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY EXTRACT(DAY FROM dt_incurred)
),
cumulative_expenses AS (
    SELECT
        d.day_of_month,
        COALESCE(e.daily_expense, 0) AS daily_expense
    FROM
        days d
    LEFT JOIN expenses_by_day e ON d.day_of_month = e.day_of_month
)
SELECT
    day_of_month,
    SUM(daily_expense) OVER (ORDER BY day_of_month) AS cum_this_month
FROM
    cumulative_expenses
ORDER BY
    day_of_month;

Here is how it works:

	WITH days AS (
    SELECT generate_series(1, 31) AS day_of_month
)

Generate Series: This generates a series of days from 1 to 31, covering all possible days in a month.

	expenses_by_day AS (
    SELECT
        EXTRACT(DAY FROM dt_incurred) AS day_of_month,
        SUM(amount_in_rs) AS daily_expense
    FROM item
    WHERE DATE_TRUNC('month', dt_incurred) = DATE_TRUNC('month', CURRENT_DATE)
    GROUP BY EXTRACT(DAY FROM dt_incurred)
)

Aggregate Daily Expenses: This subquery calculates the total expense for each day of the current month by extracting the day part from the dt_incurred column and summing the amount_in_rs.

	cumulative_expenses AS (
    SELECT
        d.day_of_month,
        COALESCE(e.daily_expense, 0) AS daily_expense
    FROM
        days d
    LEFT JOIN expenses_by_day e ON d.day_of_month = e.day_of_month
)

Combine Days and Expenses: This step ensures that each day of the month is included in the result, even if there were no expenses on that day, by using a left join between the generated days and the aggregated expenses.

	SELECT
    day_of_month,
    SUM(daily_expense) OVER (ORDER BY day_of_month) AS cum_this_month
FROM
    cumulative_expenses
ORDER BY
    day_of_month;

Calculate Cumulative Expenses: Finally, this query calculates the cumulative sum of expenses using the SUM() OVER (ORDER BY day_of_month) window function. This computes a running total of the expenses, ordered by the day of the month.

The main query builds on this to calculate the cumulative expenses for the previous month in addition to the current month.

Conclusion SQL Chops

SQL is a technology that is easy to underestimate in the Age of AI, but we do so at our own peril. Particularly when building data pipelines, one can use SQL to reshape data into the form suitable for consumption at the end-point – or as close to it as possible. In the use-case of expense reporting, SQL opened to door to compelling visualizations using templates in Grafana.

Looking to elevate your AI skills? Join our FastAI course! We cover not only cutting-edge modeling techniques but also the essential skills for acquiring, managing, and visualizing data. Learn from Ph.D. instructors who guide you through the entire process, from data collection to impactful visualizations, ensuring you’re equipped to tackle real-world challenges.

May 21 2024

Get Grafana in Docker on Raspberry Pi and Share Compelling Visualizations

sanjaybhatikar Blog Docker, Grafana, RaspberryPi, UI

Story-Telling With Grafana Create Dashboards With Interactive Visualizations

Building Machine Learning or AI models is just the beginning. Conveying the insights derived from these models is equally, if not more, important. Interactive visualizations are powerful tools for gaining buy-in and influencing decisions. However, creating effective visualizations can often require as much effort as the modeling itself.

Get Grafana in Docker on Raspberry Pi and Share Compelling Visualizations

Use Grafana in Docker on Raspberry Pi

Grafana offers a low-code or no-code solution for creating dashboards. By connecting queries to your data backend, you can quickly develop visualization templates with live data feeds. Grafana enables you to generate compelling visualizations and craft a narrative that supports your recommended actions. For example, you can use gauges to track monthly spending against a budget and recommend cost-saving measures like eating in instead of dining out.

Installation Setting up Grafana in Docker

ollow these steps to set up Grafana on your Raspberry Pi:

Pull the Docker Image: docker pull grafana/grafana
Create Folders for Persistence: grafana/data and grafana/provisioning;
Grant Access to Grafana User: chown 472:472 /path/to/grafana/data.
Run the Docker Container:

	docker run --name grafana_service \
	--restart always \
	-d -p 3000:3000 \
	-v /path/to/grafana/data:/var/lib/grafana \
	-v /path/to/grafana/provisioning:/etc/grafana/provisioning \
	-e GF_SECURITY_ADMIN_USER=admin \
	-e GF_SECURITY_ADMIN_PASSWORD=topsecretphrase \
	grafana/grafana

his command creates a Docker container named grafana_service with persistent data storage in shared volumes and sets up an admin account.

Launch Grafana: http://localhost:3000

Sharing Generate a Shareable Link

To share your Grafana dashboards:

Generate a Shareable Link: Click on the blue button labeled “SHARE” to obtain a URL for sharing. Recent versions of Grafana also allow you to create a public URL.
Update the URL for Intranet Sharing: The generated URL will contain ‘localhost’. Replace ‘localhost’ with the IP address of the host machine to share the dashboard on your local network. Use a free IP scanner like LanScan (for Mac) to find the IP address of machines on your intranet.

By following these steps, you can effectively share your Grafana dashboards, ensuring that your visualizations reach your intended audience.

Share the Grafana dashboard by generating a shareable link as shown. The localhost will need to be replaced with the host IP address. Use a tool like LanScan (Mac) to find the IP broadcast by machines on the local network.

With Grafana and Docker, setting up and sharing interactive visualizations on your Raspberry Pi has never been easier. Use these tools to create compelling dashboards that effectively communicate your data insights.

Join our FastAI class to master not only the art of building robust AI models but also the essential skills of visualizing results and deploying software seamlessly into the hands of customers. Our course, led by experienced Ph.D. instructors, covers the full spectrum of AI development—from crafting accurate models to creating interactive visualizations with tools like Grafana and managing deployments using Docker on platforms like Raspberry Pi. Gain the comprehensive expertise needed to turn data into actionable insights and deliver real-world applications efficiently.

sanjaybhatikar

About sanjaybhatikar

Posts by :

Product Dancing LED Lights

Electrical Schematics The Circuit

Digital Schematics The Code

Packaging Putting It Together

WS2812B LED Strip

Power Supply

CCTV Camera Box

Design

WAGO 221 blocks

CCTV Camera box

Screw Terminals

MCU is powered up and wired

A custom harness is fabricated for ease of connections

Conclusion Key Takeaways

Conclusion Key Takeaways

Executive Decision Python's Built-in Function

Helpful Assistant Thoughtful and Handy

Roomba Race Reasoning Guides Action

Stranger Things Fact or Fiction

Agentic AI Blueprints of an Agent

Memoryless Protocol Need for Session Management

Scraping Fool Unique Challenges Scraping Financial Web-Sites

Scrapy Session Downloader Middlewares

Conclusion Key Takeaways

Writer's Block My database is toast?

Toast on the Table What, Why & How of Toast Tables

What are TOAST Tables in PostgreSQL?

Why do we need TOAST?

How does TOAST work?

Burnt Toast What can go wrong?

Root Cause Analysis

Symptoms of Corruption

Toast of the Town Resolve the issue by reindexing

Toasty Takeaways Backups & Preventive Measures

Restore From Backup

Preventive Measures

Summary

Tee Totaller Logging messages to a file while viewing them in console

Basic Usage

Appending to a Log File

Monitoring Log File in Real-Time

References Reading List

Headless Is Not a Blind Alley Developing with an IDE on AWS EC2

Ports, Posts and Portability Expose ports on EC2 instance

Tango With Theia Installation with Docker

Who Am I? User Write Permissions Issue

Conclusions Key Takeaways

Expense Tracking Use-Case DB and Visualizations

SQL Pivots Split-Apply-Combine Strategy for Data Analysis

Cumulative Sum Split-Apply-Combine After Self-Join

Cumulative Sum Redux Sum Over Series

Cumulative Sum Sum Over Generated Series

Conclusion SQL Chops

Story-Telling With Grafana Create Dashboards With Interactive Visualizations

Installation Setting up Grafana in Docker

Sharing Generate a Shareable Link