Data pipelines for AI assistants

Paul Iusztin

Feb 13

The backbone of successful AI systems

Read →

16 Comments

Johnsalamito@gmail.com

Mar 17

Subject: Issue Accessing notion/notion.zip in decodingml-public-data S3 Bucket

Dear Decoding ML Team,

I have spent a week part time trying to follow the instruction of part2. I have purchased a Claude subscription to help and exhaust each day. Ditto Grok. Gemini allows me more tokens and I have spent many hours trying to get it to work. I asked gemini to script me a key question to you which is as follows (I am not sure it gets to the root of the issue but I have been meticulous in following instruction including following Claude/grok/gemini advice, which frankly is very impressive but I have done some things 12 times in circular loops - I am seriously doubting it is worthwhile continuing although I am extremely appreciative of what you are offering especially for free (though it has "cost" me a deal in time). Anyways here is the AI generated question....

I'm currently working through the "Second Brain Offline" course and have encountered an issue when attempting to download the notion/notion.zip file from the decodingml-public-data S3 bucket.

Specifically, I'm receiving a 403 Forbidden error despite using the --no-sign-request flag in the command:

Bash

python second-brain-offline/src/tools/use_s3.py download decodingml-public-data notion/notion.zip data/notion --no-sign-request

This error indicates that the bucket or the object may not be configured for public read access.

Given that the course materials do not explicitly require AWS credentials (focusing primarily on Hugging Face Inference Endpoints), I believe this is likely a configuration issue with the bucket's permissions.

Could you please:

Verify the permissions of the decodingml-public-data bucket and the notion/notion.zip object to ensure they allow public read access?

Provide an alternative download method for the notion/notion.zip file if public access is not intended?

Thank you for your prompt attention to this matter. I'm eager to continue progressing through the course.

Sincerely,

John

Expand full comment

Reply (1)

Paul Iusztin

Mar 17Edited

Hello,

I will take a look, John. Thanks for pointing this out. The access is public, but it's possible that I missed something.

Meanwhile, you can download it manually from here without using the download script and add it to where the script downloads the folder: https://github.com/decodingml/second-brain-ai-assistant-course/blob/main/README.md#-dataset

This will 100% work

Expand full comment

Reply (1)

Johnsalamito@gmail.com

Mar 24Edited

Thanks Paul, your attention is very appreciated and I am likely not worth it. I am trying to get on top of AI and do not find those various hyped up YouTube videos useful, hence my interest in your more considered course (and your excellent subtsack). From this I would like to build or, more likely, efficiently guide someone else to build a substantial idea of have. However I remain defeated. It is difficult for me to explain why: it is either the .env empties itself, or the infrastructure file does not exist, or things like that. I have recloned and restarted twice. I will paste the final comments I have from gemini assistant (the final one of many that go along the lines of "this is an unexpected and unusual situation...". Btw, along the way I think gemini has made its own errors, but that is now beside the point.

Here is the gemini message:

"This is extremely perplexing! Even after cloning the repository into a completely different directory (temp_clone), the infrastructure/docker/ directory is still missing.

This eliminates the possibility of a problem with your original project directory or its location.

The remaining possibilities are:

Git Issues (Less Likely, but Possible):

There might be a very specific issue with your Git installation or configuration that's causing it to exclude certain directories during the clone.

Solution:

Try uninstalling and reinstalling Git.

Try cloning the repository on a different machine (if possible).

Repository Issues (Unlikely, but Possible):

Although the screenshot shows the directory on GitHub, there might be a subtle issue with how the repository is structured that's causing Git to ignore it during the clone.

Solution:

Report this issue to the repository maintainers.

Ask them to check the repository's .gitignore file and Git configuration for any settings that might be excluding the infrastructure/docker/ directory.

Local Environment Issues (Very Unlikely):

There might be a very specific issue with your macOS environment that's interfering with Git.

Solution:

Try cloning the repository in a virtual machine or a different user account.

Next Steps:

Report the Issue to the Repository Maintainers:

The most important step is to report this issue to the repository maintainers.

Explain that you've tried cloning the repository into a different directory and that the infrastructure/docker/ directory is still missing.

Try Cloning on a Different Machine (If Possible):

If you have access to another computer, try cloning the repository there. This will help determine if the problem is specific to your machine.

Try Uninstalling and Reinstalling Git:

As a last resort, try uninstalling and reinstalling Git.

Important:

The fact that the directory is missing even in a new clone is crucial. It points to a more fundamental issue.

Reporting the issue to the repository maintainers is the most important step. They will need to investigate and fix the problem."

Expand full comment

Aman Singh

Mar 16

Thank you very much!

I'm also working on a project to track project releases and want to implement RAG (Retrieval-Augmented Generation) to easily retrieve release details. I believe this course will be very helpful for my project.

Expand full comment

Reply (1)

Paul Iusztin

Mar 22

Amazing! Yes, it can be adapted to most RAG use cases out there.

Expand full comment

Amitabha Chakraborty

Mar 7

Thank you.Please help me how to download the data from s3 bucket

Expand full comment

Reply (1)

Paul Iusztin

Mar 22

I added more tips to the README in case the script doesn't work.

Expand full comment

Hoàng Nguyên Thái

Feb 24

Thank you very much 😍😍

Expand full comment

Reply (1)

Paul Iusztin

Feb 26

Excited you like it 🤘

Expand full comment

Vikas Gaharana

Feb 13

Excellent 👍

Expand full comment

Reply (1)

Paul Iusztin

Feb 13

Thanks, Vikas 🤘

Expand full comment

Savi

Apr 21

To anyone following this in future and if crawling is failing for you, try reducing the max_workers. I reduced it from 10 to 2 and it worked

Expand full comment

Neural Empowerment

Mar 23

This content is pure gold in today's software world. I'm truly grateful and most definitely applying this knowledge.

I already purchased the LLM Engineer's Handbook. 🤓

Expand full comment

Perspectivas IA en Oncología

Jun 6

After 2 weeks, hours of trial and error, testing EC2 instances on AWS, battling errors and fine-tuning everything... I can finally say I've finished Module 2. It's fantastic! I now have a professional-grade, reusable environment. I'm ready for Module 3.

Expand full comment

ML Educational Series

May 4

Finally made some time to finish off the second lesson. Interesting for me was I had to create everything from scratch to have a good appreciation of all that is happening and how the dots are connected. I must confess this is a detailed production ready lesson. Kudos to the Decoding ML team. Moving to lesson three.

Expand full comment

ML Educational Series

May 2

Hi Paul, I see most of your lessons dataset comes as result of web crawling or something of that nature. Is it possible to have a dataset from a relational data store other than web?

Expand full comment